xxxxxxxxxx<img src="images/churn_image.jpeg" alt="Churn Image">
xxxxxxxxxx# Telecom Churn Analysis: How To Keep Your Customers "On the Line"---Authors: Jared Mitchell, Andrew Marinelli, Wes NewcombAuthors: Jared Mitchell, Andrew Marinelli, Wes Newcomb
xxxxxxxxxx## Overview---In this notebook, we analyze and build classification models with the data from Syria, a telecom company, in an effort to understand the relationships and patterns between several customer variables and customer churn. After cleaning and encoding the data, we take an iterative and comparative approach to model production, eventually converging in a robust classification model that can determine with sufficient power the likelihood that a given customer will churn. In this notebook, we analyze and build classification models with the data from Syria, a telecom company, in an effort to understand the relationships and patterns between several customer variables and customer churn. After cleaning and encoding the data, we take an iterative and comparative approach to model production, eventually converging in a robust classification model that can determine with sufficient power the likelihood that a given customer will churn.
xxxxxxxxxx## Business Understanding---Churn has long been king for companies wishing to determine the success of their product. Intuitively, customers wouldn't drop your service if they liked it, right? According to churn expert Patrick Campbell, "Your churn rate is a direct reflection of the value of the product and features that you're offering to customers." Further, when combined with other features of your service, such as cost, we can determine the price at which the offered service becomes most profitable. The idea is that we're willing to lose some customers to an increased cost of service as long as the double bottom line profit grows as a result. Thus, the question is born: Is there a way that we can predict churn on a client-by-client basis, so that we can shift from a <b>reactive</b> to a <b>proactive</b> approach to business decisions with respect to items such as product feature implementations, customer service operations, retention campaigns, and pricing optimization? The short answer is yes; armed with a predictive model, SyriaTel can not only make its service better, but it can also increase its profits. Churn has long been king for companies wishing to determine the success of their product. Intuitively, customers wouldn't drop your service if they liked it, right? According to churn expert Patrick Campbell, "Your churn rate is a direct reflection of the value of the product and features that you're offering to customers." Further, when combined with other features of your service, such as cost, we can determine the price at which the offered service becomes most profitable. The idea is that we're willing to lose some customers to an increased cost of service as long as the double bottom line profit grows as a result. Thus, the question is born: Is there a way that we can predict churn on a client-by-client basis, so that we can shift from a reactive to a proactive approach to business decisions with respect to items such as product feature implementations, customer service operations, retention campaigns, and pricing optimization? The short answer is yes; armed with a predictive model, SyriaTel can not only make its service better, but it can also increase its profits.
xxxxxxxxxx## Data Exploration---The SyriaTel dataset consists of 21 columns and 3333 rows. Each row represents information about a unique account holder. The dataset was complete and consistent for all rows upon our reception of it. It is not clear the time period that this dataset represents.The SyriaTel dataset consists of 21 columns and 3333 rows. Each row represents information about a unique account holder. The dataset was complete and consistent for all rows upon our reception of it. It is not clear the time period that this dataset represents.
xxxxxxxxxximport numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsimport plotly.express as px # Be sure to import express%matplotlib inlinexxxxxxxxxxdf = pd.read_csv('data/telecom_customer_churn.csv')df.head()xxxxxxxxxxThe information provided per client includes how long they've been with SyriaTel in months (account length); which plans they are signed up for (international plan, voice mail plan); usage metrics (total day minutes, total night charge); the number of calls they made to customer support; and of course, churn status. The information provided per client includes how long they've been with SyriaTel in months (account length); which plans they are signed up for (international plan, voice mail plan); usage metrics (total day minutes, total night charge); the number of calls they made to customer support; and of course, churn status.
xxxxxxxxxxdf.info()xxxxxxxxxxdf.describe()xxxxxxxxxxdf['churn'].value_counts()xxxxxxxxxxWe are apparently dealing with an unbalanced dataset, which means that we will have to be careful in applying proper weights to our outcome groups. However, the imbalance is not so extreme as to beckon measures that hedge against the potential skewing of results. We are apparently dealing with an unbalanced dataset, which means that we will have to be careful in applying proper weights to our outcome groups. However, the imbalance is not so extreme as to beckon measures that hedge against the potential skewing of results.
xxxxxxxxxxSome of the columns were immediately identifiable as being not relevant to predicting churn - such as phone number and area code. We dropped these columns from the data set outright. One might immediately think that area code is connected to customer region. However, some people (perhaps a majority) keep their numbers when they move, so someone with a San Francisco phone number could very well be living in South Dakota. We can safely conclude that area code does not contain robust customer information, and because phone numbers are semi-randomly generated, the same can be said for its column. Some of the columns were immediately identifiable as being not relevant to predicting churn - such as phone number and area code. We dropped these columns from the data set outright. One might immediately think that area code is connected to customer region. However, some people (perhaps a majority) keep their numbers when they move, so someone with a San Francisco phone number could very well be living in South Dakota. We can safely conclude that area code does not contain robust customer information, and because phone numbers are semi-randomly generated, the same can be said for its column.
xxxxxxxxxx# drop columns of little importance to determining churn, as determined by ...# the fact that they are arbitrarily assigned by the telecom companydf = df.drop(['area code', 'phone number'], axis=1)xxxxxxxxxxSome of the columns needed to be reformatted from yes/no to binary, 1/0 style. Some of the columns needed to be reformatted from yes/no to binary, 1/0 style.
xxxxxxxxxxdf['churn'] = df['churn'].astype(int)df['international plan'] = df['international plan'].map(lambda x: 1 if x=='yes' else 0)df['voice mail plan'] = df['voice mail plan'].map(lambda x: 1 if x=='yes' else 0)xxxxxxxxxxFurther, the successes of some telecom companies are regional specific - perhaps they offer great coverage in some regions, but terrible coverage in other regions. However, this is not the case for SyriaTel, as they do not have characteristic regional customer counts or churn rates. We can see this from the following visual representations. Further, the successes of some telecom companies are regional specific - perhaps they offer great coverage in some regions, but terrible coverage in other regions. However, this is not the case for SyriaTel, as they do not have characteristic regional customer counts or churn rates. We can see this from the following visual representations.
xxxxxxxxxxstates_df = pd.DataFrame(df.state.value_counts()).reset_index()states_df = states_df.rename(columns={'index':'state', 'state':'value_count'})states_df = states_df.sort_values('state')states_df = states_df.merge(df.groupby(['state'])['churn'].mean(), on='state')states_df['churn'] = states_df['churn'] * 100x
fig = px.choropleth(states_df, # Input Pandas DataFrame locations='state', # DataFrame column with locations color="value_count", # DataFrame column with color values hover_name="state", # DataFrame column hover info locationmode = 'USA-states', # Set to plot as US States labels={'value_count':'Number of Clients'}, color_continuous_scale=px.colors.sequential.Blues) fig.update_layout( title_text = 'State Rankings By Customer Count', # Create a Title geo_scope='usa', # Plot only the USA instead of globe)fig.show()xxxxxxxxxxfig = px.choropleth(states_df, # Input Pandas DataFrame locations='state', # DataFrame column with locations color="churn", # DataFrame column with color values hover_name="state", # DataFrame column hover info locationmode = 'USA-states', # Set to plot as US States labels={'churn':'Churn Rate (%)'}, color_continuous_scale=px.colors.sequential.Oranges) fig.update_layout( title_text = 'State Rankings by Churn Rate', # Create a Title geo_scope='usa', # Plot only the USA instead of globe)fig.show()xxxxxxxxxxThere is a significant difference in churn rate by state, which we can see in the bar chart below.There is a significant difference in churn rate by state, which we can see in the bar chart below.
xxxxxxxxxxgb = df.groupby(['state'])['churn'].mean() * 100gb.sort_values().plot(kind='bar', figsize=(12,12));plt.title('Churn Rate By State', fontsize=14);plt.ylabel('Churn Rate (%)');plt.xlabel('State')plt.show()xxxxxxxxxxWhile some areas are do have significantly higher churn rates than others, we cannot take this as chracteristic of the state customer populations because the populations by state are small - if we only have data on 30 customers from California, we cannot be confident that the statistcs on those 30 customers represent all the customers in California. Finally, we can say with relative certainty that the regional representations in the dataset do not chracterize the churn rates - states with a larger representation are not more likely to represent high or lower churn rates, generally speaking, than states with smaller representation. While some areas are do have significantly higher churn rates than others, we cannot take this as chracteristic of the state customer populations because the populations by state are small - if we only have data on 30 customers from California, we cannot be confident that the statistcs on those 30 customers represent all the customers in California. Finally, we can say with relative certainty that the regional representations in the dataset do not chracterize the churn rates - states with a larger representation are not more likely to represent high or lower churn rates, generally speaking, than states with smaller representation.
xxxxxxxxxxstates_df.corr()xxxxxxxxxxAs a result, we can drop the state column from the dataset.As a result, we can drop the state column from the dataset.
xxxxxxxxxxdf = df.drop(['state'], axis=1)xxxxxxxxxxWe do have some columns that are nearly perfectly collinear as well:We do have some columns that are nearly perfectly collinear as well:
xxxxxxxxxxsns.heatmap(df.corr().abs());plt.title('Correlations Between Telecom Variables', fontsize=14)plt.show()xxxxxxxxxxSince number of user minutes per time period is a more direct metric than charge per corresponding time period; and because number of voicemail messages and total international charge are directly consistent with whether a customer has a plan for each, respectively, we can safely drop those columns from the dataset. Since number of user minutes per time period is a more direct metric than charge per corresponding time period; and because number of voicemail messages and total international charge are directly consistent with whether a customer has a plan for each, respectively, we can safely drop those columns from the dataset.
xxxxxxxxxxdf = df.drop(['number vmail messages','total day charge','total eve charge', 'total night charge', 'total intl charge'], axis=1)xxxxxxxxxxTrain-test split, <i>et voilà</i> ! The dataset is clean and ready for production. Train-test split, et voilà ! The dataset is clean and ready for production.
xxxxxxxxxxfrom sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import StandardScalerfrom sklearn.preprocessing import MinMaxScalerX_cols = df.drop(['churn'], axis=1).columnsy_cols = ['churn']X = df.drop(['churn'], axis=1)y = df['churn']X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)xxxxxxxxxx## Modeling---We took an iterative approach to modeling the data: starting with a baseline, then progressing through the simplest models; and finally, exploring more advanced models.We determined that a customized F-style score is the most appropriate metric for measuring the success of our model because (1) we are dealing with imbalanced data, thus a skewed high baseline accuracy; and (2) we are interested in a healthy medium between identifying customers who are going to churn and misidentifying customers who are not going to churn. You could understand this simply as: each customer we save from churn yields a large increase in revenue; where every customer we misidentify as churn yields a small decrease revenue. Thus we decided to optimize our models against the F4-score: a model scoring system that quadruple weights recall with respect to precision. We do this because we understand that the cost of losing a customer is far more expensive that the cost of accidentally being overly generous with a customer who we were not going to lose in the first place. We took an iterative approach to modeling the data: starting with a baseline, then progressing through the simplest models; and finally, exploring more advanced models.
We determined that a customized F-style score is the most appropriate metric for measuring the success of our model because (1) we are dealing with imbalanced data, thus a skewed high baseline accuracy; and (2) we are interested in a healthy medium between identifying customers who are going to churn and misidentifying customers who are not going to churn. You could understand this simply as: each customer we save from churn yields a large increase in revenue; where every customer we misidentify as churn yields a small decrease revenue. Thus we decided to optimize our models against the F4-score: a model scoring system that quadruple weights recall with respect to precision. We do this because we understand that the cost of losing a customer is far more expensive that the cost of accidentally being overly generous with a customer who we were not going to lose in the first place.
xxxxxxxxxx#import all necessary packages for modelingfrom sklearn.linear_model import LogisticRegressionfrom sklearn.model_selection import RepeatedStratifiedKFoldfrom sklearn.feature_selection import RFEfrom sklearn.pipeline import Pipelinefrom sklearn.dummy import DummyClassifierfrom sklearn.model_selection import cross_val_scorefrom sklearn.preprocessing import Normalizerfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.model_selection import GridSearchCVfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.metrics import classification_report, precision_score, \ accuracy_score, recall_score, f1_scorefrom sklearn.metrics import confusion_matrixfrom sklearn.metrics import plot_confusion_matrixfrom sklearn.naive_bayes import GaussianNBfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.metrics import fbeta_score, make_scorerxxxxxxxxxxf4_score = make_scorer(fbeta_score, beta=4)xxxxxxxxxxLogically, we should start out with the current state of the company. SyriaTel does not have any way of predicting customer churn at the moment; in other words, SyriaTel treats every customer as though they were not going to churn. Our baseline model reflects this strategem. Logically, we should start out with the current state of the company. SyriaTel does not have any way of predicting customer churn at the moment; in other words, SyriaTel treats every customer as though they were not going to churn. Our baseline model reflects this strategem.
xxxxxxxxxxdummy = DummyClassifier(strategy='most_frequent', random_state=42)dummy.fit(X_train, y_train)cv = cross_val_score(dummy, X_train, y_train, scoring='f1')cvxxxxxxxxxxy_pred_dummy = dummy.predict(X_test)print(classification_report(y_test, y_pred_dummy, zero_division=0))plot_confusion_matrix(estimator=dummy, X=X_test, y_true=y_test);xxxxxxxxxxClearly, we are starting from zero here: Our F4-score is 0, even though we do have 85% accuracy. Clearly, we are starting from zero here: Our F4-score is 0, even though we do have 85% accuracy.
xxxxxxxxxxWe decided that Logistic Regression would make a good starting point as it is relatively simple to understand and easy to implement. Our approach iteratively added features with respect to feature importance as judged by its relative ability to determine our target variable. Additionally, we tuned our Logistic Regressor hyperparameters over sufficient parameter space to say that this is the best logistic regression model obtainable, given certain time and complexity constraints. We also experimented with different data preprocessing techniques. We decided that Logistic Regression would make a good starting point as it is relatively simple to understand and easy to implement. Our approach iteratively added features with respect to feature importance as judged by its relative ability to determine our target variable. Additionally, we tuned our Logistic Regressor hyperparameters over sufficient parameter space to say that this is the best logistic regression model obtainable, given certain time and complexity constraints. We also experimented with different data preprocessing techniques.
xxxxxxxxxxlr_pipe = Pipeline(steps=[ ('scaler', MinMaxScaler()), ('rfe', RFE(estimator=LogisticRegression(random_state=42))), ('lr', LogisticRegression(random_state=42)) ])lr_grid = {'scaler':[MinMaxScaler(), StandardScaler()], 'rfe__n_features_to_select':list(range(1,12)), 'lr__tol':[.01, .0001, .000001], 'lr__C':[100, 10, 1, .1, .01, .001], 'lr__class_weight':[None, 'balanced']}xxxxxxxxxxlr_gs = GridSearchCV(estimator=lr_pipe, param_grid=lr_grid, cv=RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=42), scoring=f4_score)lr_gs.fit(X_train, np.ravel(y_train))lr_best = lr_gs.best_estimator_xxxxxxxxxxy_train_pred_lr = lr_best.predict(X_train)y_test_pred_lr = lr_best.predict(X_test)print('Train Statistics')print(classification_report(y_train, y_train_pred_lr))print()print('Test Statistics')print(classification_report(y_test, y_test_pred_lr))plot_confusion_matrix(estimator=lr_best, X=X_train, y_true=y_train);plt.title('Train');plot_confusion_matrix(estimator=lr_best, X=X_test, y_true=y_test);plt.title('Test');xxxxxxxxxxThis model shows a significant improvement from baseline, with an F1-score of 0.47 while only sacrificing 8 accuracy points. Additionally, it does not appear as though we are overfitting. This model shows a significant improvement from baseline, with an F1-score of 0.47 while only sacrificing 8 accuracy points. Additionally, it does not appear as though we are overfitting.
xxxxxxxxxxWe also took an iterative approach to creating our kNN model, incorporating features into our model with resepect to their correlation with the target variable, churn; however, we do not contend that the feature combination we found to produce the best model is necessarily the best combination. Feature selection in kNNs can be particularly challenging, and the most modern methods are not comprehensive. We also took an iterative approach to creating our kNN model, incorporating features into our model with resepect to their correlation with the target variable, churn; however, we do not contend that the feature combination we found to produce the best model is necessarily the best combination. Feature selection in kNNs can be particularly challenging, and the most modern methods are not comprehensive.
xxxxxxxxxxcorrs = df.corr().abs()['churn'].sort_values(ascending=False).drop('churn')ordered_corrs = list(corrs.index)xxxxxxxxxxknn_pipe = Pipeline([('scaler', StandardScaler()), ('normalizer', Normalizer()), ('knn', KNeighborsClassifier())])knn_grid = { 'scaler':[None, StandardScaler(), MinMaxScaler()], 'normalizer':[None, Normalizer()], 'knn__n_neighbors':[1,3,7,11,17], 'knn__p':[1,2,3], 'knn__weights': ['uniform','distance']}xxxxxxxxxxmodels = []num_features = []train_preds = []test_preds = []for i in range(1, len(ordered_corrs)): features = ordered_corrs[:i] X_train_knn = X_train[features] X_test_knn = X_test[features] cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=42) gs_knn = GridSearchCV(estimator=knn_pipe, param_grid=knn_grid, cv=cv, scoring=f4_score) gs_knn.fit(X_train_knn, np.ravel(y_train)) y_pred_train = gs_knn.predict(X_train_knn) y_pred_test = gs_knn.predict(X_test_knn) models.append(gs_knn.best_estimator_) num_features.append(str(i)) train_preds.append(y_pred_train) test_preds.append(y_pred_test)xxxxxxxxxxaccuracy_scores = []precision_scores = []recall_scores = []f1_scores = []for i in range(len(num_features)): y_train_pred = train_preds[i] y_test_pred = test_preds[i] accuracy_scores.append(accuracy_score(y_test, y_test_pred)) precision_scores.append(precision_score(y_test, y_test_pred)) recall_scores.append(recall_score(y_test, y_test_pred)) f1_scores.append(f1_score(y_test, y_test_pred))xxxxxxxxxxWe can see from the graphic below that we achieve our maximum F1-score at 8 features, while maintaining healthy accuracy, recall, and precision levels. We can see from the graphic below that we achieve our maximum F1-score at 8 features, while maintaining healthy accuracy, recall, and precision levels.
xxxxxxxxxxplt.plot(num_features, precision_scores, label='Precision')plt.plot(num_features, accuracy_scores, label='Accuracy')plt.plot(num_features, recall_scores, label='Recall')plt.plot(num_features, f1_scores, label='F1')plt.legend()plt.show()xxxxxxxxxxss = StandardScaler()normalizer = Normalizer()knn_best = KNeighborsClassifier(n_neighbors=3, p=1, weights='distance')X_train_knn = ss.fit_transform(X_train[ordered_corrs[:8]])X_test_knn = ss.transform(X_test[ordered_corrs[:8]])X_train_knn = normalizer.transform(X_train_knn)X_test_knn = normalizer.transform(X_test_knn)knn_best.fit(X_train_knn, y_train)xxxxxxxxxxy_train_pred_knn = knn_best.predict(X_train_knn)y_test_pred_knn = knn_best.predict(X_test_knn)print(classification_report(y_train, y_train_pred_knn))print(classification_report(y_test, y_test_pred_knn))plot_confusion_matrix(knn_best, X=X_train_knn, y_true=y_train);plt.title('Train');plot_confusion_matrix(knn_best, X=X_test_knn, y_true=y_test);plt.title('Test');xxxxxxxxxxWe see a significant improvement in both overall accuracy as well as F1-score with respect to the logistic regession model. Not bad! However, we may be overfitting, which we can see from the dramatic differences between train and test scores across various metrics. To resolve this, we will remove some of the input features to reduce model complexity.We see a significant improvement in both overall accuracy as well as F1-score with respect to the logistic regession model. Not bad! However, we may be overfitting, which we can see from the dramatic differences between train and test scores across various metrics. To resolve this, we will remove some of the input features to reduce model complexity.
xxxxxxxxxxNaive Bayes offers another simple approach to predicting churn. We chose not to incorporate feature selection here because Naive Bayes usually benefits from more features, given that that number is not extraneously large (in the order of hundreds). Naive Bayes offers another simple approach to predicting churn. We chose not to incorporate feature selection here because Naive Bayes usually benefits from more features, given that that number is not extraneously large (in the order of hundreds).
xxxxxxxxxxgnb_pipe = Pipeline(steps=[ ('scaler', StandardScaler()), ('gnb', GaussianNB()) ])gnb_grid = {'scaler':[None, StandardScaler()], 'gnb__var_smoothing':[1e-12, 1e-9, 1e-6, 1e-3]}cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=42)gs_gnb = GridSearchCV(estimator=gnb_pipe, param_grid=gnb_grid, scoring=f4_score)gs_gnb.fit(X_train, y_train)gnb_best = gs_gnb.best_estimator_xxxxxxxxxxy_train_pred_gnb = gnb_best.predict(X_train)y_test_pred_gnb = gnb_best.predict(X_test)print('Train')print(classification_report(y_train, y_train_pred_gnb))print()print('Test')print(classification_report(y_test, y_test_pred_gnb))plot_confusion_matrix(gnb_best, X_train, y_train);plt.title('Train');plot_confusion_matrix(gnb_best, X_test, y_test);plt.title('Test');xxxxxxxxxxThis may not be our best-performing model, but its results are still better than our baseline. Additionally, we do not have evidence of a case of overfitting, as the test evaulation metrics reflect the train evaulation metrics to a high degree. This may not be our best-performing model, but its results are still better than our baseline. Additionally, we do not have evidence of a case of overfitting, as the test evaulation metrics reflect the train evaulation metrics to a high degree.
xxxxxxxxxxOur final simple model, the decision tree is powerful because of its bare-bones preprocessing requirements. There is no concept of recursive feature elimination for decision trees, but we still took a grid search approach to determing the best hyperparameters. Our final simple model, the decision tree is powerful because of its bare-bones preprocessing requirements. There is no concept of recursive feature elimination for decision trees, but we still took a grid search approach to determing the best hyperparameters.
xxxxxxxxxxdt_pipe = Pipeline(steps=[('dt', DecisionTreeClassifier(random_state=42))])dt_grid = { 'dt__criterion':['gini','entropy'], 'dt__max_depth':list(range(2,16)), 'dt__min_samples_split':list(range(2,12)), 'dt__min_samples_leaf':list(range(5,25,4)), 'dt__class_weight':[None, 'balanced']}cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=42)gs_dt = GridSearchCV(estimator=dt_pipe, param_grid=dt_grid, scoring=f4_score, cv=cv)gs_dt.fit(X_train, y_train);dt_best = gs_dt.best_estimator_xxxxxxxxxxy_train_pred_dt = dt_best.predict(X_train)y_test_pred_dt = dt_best.predict(X_test)print('Train')print(classification_report(y_train, y_train_pred_dt))print()print('Test')print(classification_report(y_test, y_test_pred_dt))plot_confusion_matrix(dt_best, X_train, y_train);plt.title('Train');plot_confusion_matrix(dt_best, X_test, y_test);plt.title('Test');xxxxxxxxxxOnce again, we are beating our baseline; but there is a good chance that we are overfitting, evinced by the lack of coordination between train and test results. Once again, we are beating our baseline; but there is a good chance that we are overfitting, evinced by the lack of coordination between train and test results.
xxxxxxxxxx## Complex Models---In this section, we will explore the results of more complex models, which are generally capable of better predictions but at the cost of higher computational complexity. Note that we did not choose these models for any particular reason; though our motivation for XGBoost comes from its fame within the industry.In this section, we will explore the results of more complex models, which are generally capable of better predictions but at the cost of higher computational complexity. Note that we did not choose these models for any particular reason; though our motivation for XGBoost comes from its fame within the industry.
xxxxxxxxxxWe expect that the random forest will perform well - and probably better than any of our simple models. We expect that the random forest will perform well - and probably better than any of our simple models.
xxxxxxxxxxrf_pipe = Pipeline(steps=[('rf', RandomForestClassifier(random_state=42))])rf_grid = { 'rf__n_estimators':list(range(10,101,10)), 'rf__class_weight':[None, 'balanced'], 'rf__max_depth':list(range(1,11))}cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=42)rf_gs = GridSearchCV(estimator=rf_pipe, param_grid=rf_grid, cv=cv, scoring=f4_score)rf_gs.fit(X_train, np.ravel(y_train))rf_best = rf_gs.best_estimator_xxxxxxxxxxy_train_pred_rf = rf_best.predict(X_train)y_test_pred_rf = rf_best.predict(X_test)print('Train')print(classification_report(y_train, y_train_pred_rf))print()print('Test')print(classification_report(y_test, y_test_pred_rf))plot_confusion_matrix(rf_best, X_train, y_train);plt.title('Train');plot_confusion_matrix(rf_best, X_test, y_test)plt.title('Test');xxxxxxxxxxIt looks as though our random forest has given us our best results thus far. However, it may be overfitting some, so we need to tweak a few of the parameters to ensure that our model does not overfit to the train data. It looks as though our random forest has given us our best results thus far. However, it may be overfitting some, so we need to tweak a few of the parameters to ensure that our model does not overfit to the train data.
xxxxxxxxxxAnd finally - the holy grail of machine learning classification models - XGBoost. Because of its track record as being a superstar algorithm amongst classification models, we expect that XGBoost will top all of our models thus far by decent margin. And finally - the holy grail of machine learning classification models - XGBoost. Because of its track record as being a superstar algorithm amongst classification models, we expect that XGBoost will top all of our models thus far by decent margin.
xxxxxxxxxxfrom xgboost import XGBClassifierxxxxxxxxxxxgb_pipe = Pipeline(steps=[('scaler', StandardScaler()), ('xgb', XGBClassifier())])xgb_grid = { 'xgb__learning_rate': [0.1, 0.2], 'xgb__max_depth': [2, 5], 'xgb__min_child_weight': [1, 2], 'xgb__subsample': [0.2, 0.5], 'xgb__n_estimators': [10, 65, 100], 'xgb__colsample_bytree':[.2, .5], 'xgb__scale_pos_weight': [.2, .4]}cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=42)gs_xgb = GridSearchCV(estimator=xgb_pipe, param_grid=xgb_grid, scoring=f4_score, cv=cv, n_jobs=16)gs_xgb.fit(X_train, y_train)xgb_best = gs_xgb.best_estimator_xxxxxxxxxxy_train_pred_xgb = gs_xgb.predict(X_train)y_test_pred_xgb = gs_xgb.predict(X_test)print('Train')print(classification_report(y_train, y_train_pred_xgb))print()print('Test')print(classification_report(y_test, y_test_pred_xgb))plot_confusion_matrix(gs_xgb, X_train, y_train);plt.title('Train');plot_confusion_matrix(gs_xgb, X_test, y_test);plt.title('Test');xxxxxxxxxxBy comparing our evaulation metrics, we know that our model is overfitting. Let's mitigate the issue by tuning some of our hyperparameters!By comparing our evaulation metrics, we know that our model is overfitting. Let's mitigate the issue by tuning some of our hyperparameters!
xxxxxxxxxxxgb_bestxxxxxxxxxxxgb_best = Pipeline(steps=[('scaler', StandardScaler()), ('xgb', XGBClassifier(learning_rate= 0.2, max_depth= 5, min_child_weight= 2, gamma=2.6, n_estimators= 25, subsample= 0.7))])xgb_best.fit(X_train, y_train)xxxxxxxxxxy_train_pred_xgb = xgb_best.predict(X_train)y_test_pred_xgb = xgb_best.predict(X_test)print('Train')print(classification_report(y_train, y_train_pred_xgb))print()print('Test')print(classification_report(y_test, y_test_pred_xgb))plot_confusion_matrix(gs_xgb, X_train, y_train);plt.title('Train');plot_confusion_matrix(gs_xgb, X_test, y_test);plt.title('Test');xxxxxxxxxxWe evaluate our models based on their abilities to maximize the number of identified churning customers while minimizing the misidentified non-churning customers. Again, the F-4 score comes in handy because it enables us to quadruple weight recall as to incentivize our model to treat the identification of churning customers with higher priority than hedging against the misidentification of non-churning customers. This is because mitigating a single unit of churn has a higher return in revenue than does the loss associated with a single unit of non-churn, we estimate in the around the order of 4 times. Additional considerations include model complexity/interpretability and predictive time consumption.We evaluate our models based on their abilities to maximize the number of identified churning customers while minimizing the misidentified non-churning customers. Again, the F-4 score comes in handy because it enables us to quadruple weight recall as to incentivize our model to treat the identification of churning customers with higher priority than hedging against the misidentification of non-churning customers. This is because mitigating a single unit of churn has a higher return in revenue than does the loss associated with a single unit of non-churn, we estimate in the around the order of 4 times.
Additional considerations include model complexity/interpretability and predictive time consumption.
xxxxxxxxxxmodel_types = ['LogReg', 'k-NN', 'GNB', 'DT', 'RF', 'XGBoost']best_models = [lr_best, knn_best, gnb_best, dt_best, rf_best, xgb_best]y_preds = [y_test_pred_lr, y_test_pred_knn, y_test_pred_gnb, y_test_pred_dt, y_test_pred_rf, y_test_pred_xgb,]accuracies = [accuracy_score(y_test, y_pred) for y_pred in y_preds]precisions = [precision_score(y_test, y_pred) for y_pred in y_preds]recalls = [recall_score(y_test, y_pred) for y_pred in y_preds]f4s = [5*p*r/(p + 4*r) for p,r in zip(precisions, recalls)]xxxxxxxxxxfig, ax = plt.subplots(nrows=2, ncols=2, figsize=(12,12))ax[0,0].bar(model_types, accuracies);ax[0,0].set_title('Accuracy');ax[0,1].bar(model_types, precisions);ax[0,1].set_title('Precision');ax[1,0].bar(model_types, recalls);ax[1,0].set_title('Recall');ax[1,1].bar(model_types, f4s);ax[1,1].set_title('F4');fig.text(0.5, 0.04, 'Models', ha='center', fontsize=14);fig.text(0.04, 0.5, 'Scores', va='center', rotation='vertical', fontsize=14);plt.show()xxxxxxxxxxBased soley on our metrics, we would expect that XGBoost would do the best job in the field. However, Random Forest and kNN also perform well. We eliminate kNN because of its lack of interpretability. Between Random Forest and XGB, the better model would be decided based on retnetion campaign metrics - so we cannot say for certain which would be better for SyriaTel until be know more about their client base. Either will perform to an extent to give SyriaTel an edge over its competition, and will lead to an increase in revenue overall for SyriaTel. Based soley on our metrics, we would expect that XGBoost would do the best job in the field. However, Random Forest and kNN also perform well. We eliminate kNN because of its lack of interpretability. Between Random Forest and XGB, the better model would be decided based on retnetion campaign metrics - so we cannot say for certain which would be better for SyriaTel until be know more about their client base. Either will perform to an extent to give SyriaTel an edge over its competition, and will lead to an increase in revenue overall for SyriaTel.
xxxxxxxxxxThere were several ends that we did not investigate. For example, we did not explore several combinations of our variables per model, and we did rule out some possibilities early on as a result of what one might term pragmatism that could have yielded some signficant result. We classify our approach as that which is most likely to yield the best results given time constraints. We also did have several cases of potential overfitting that we may not have addressed sufficiently. With this given, we believe that we have perfomed as exhaustive a search as we could have.That said, we believe that our final model will be able to boost overall revenues for SyriaTel by a company-changing amount. With our model, the future is brighter than it looked yesterday. There were several ends that we did not investigate. For example, we did not explore several combinations of our variables per model, and we did rule out some possibilities early on as a result of what one might term pragmatism that could have yielded some signficant result. We classify our approach as that which is most likely to yield the best results given time constraints. We also did have several cases of potential overfitting that we may not have addressed sufficiently. With this given, we believe that we have perfomed as exhaustive a search as we could have.
That said, we believe that our final model will be able to boost overall revenues for SyriaTel by a company-changing amount. With our model, the future is brighter than it looked yesterday.
xxxxxxxxxxWe absolutely must pursue several leads in order to fuel optimal model development:<ul> <li><b>Find more robust churn predictors</b>, such as dropped calls per customer or internet download speed per region. If we know why people are churning, we can do a better job addressing churn.</li> <li><b>Analyze promotional success</b>, we need to know the types and the success rates of various promotionals so that we can better calibrate our model for overall revenues.</li> <li><b>Market research</b>, we need to know why customers are churning and which companies they're going to after they churn. That way, we know where we can improve.</li> <li><b>Customer survey</b>, we need to know on a case-to-case level how customers are using our service. That way, we know what to offer each customer in the case that they show warning signs of churn.</li></ul> Overall, there is still much work to be done. We look forward to working with SyriaTel through these challenges in order to maximize overall revenue. We absolutely must pursue several leads in order to fuel optimal model development:
Overall, there is still much work to be done. We look forward to working with SyriaTel through these challenges in order to maximize overall revenue.